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Abstract —We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of 
job interviews. The proposed framework is trained by analyzing the videos of 138 interview sessions with 69 internship-seeking 
undergraduates at the Massachusetts Institute of Technology (MIT). Our automated analysis includes facial expressions (e.g., smiles, 
head gestures, facial tracking points), language (e.g., word counts, topic modeling), and prosodic information (e.g., pitch, intonation, 
and pauses) of the interviewees. The ground truth labels are derived by taking a weighted average over the ratings of 9 independent 
judges. Our framework can automatically predict the ratings for interview traits such as excitement, friendliness, and engagement with 
correlation coefficients of 0.75 or higher, and can quantify the relative importance of prosody, language, and facial expressions. By 
analyzing the relative feature weights learned by the regression models, our framework recommends to speak more fluently, use less 
filler words, speak as “we” (vs. “I”), use more unique words, and smile more. We also find that the students who were rated highly while 
answering the first interview question were also rated highly overall (i.e., first impression matters). Finally, our MIT Interview dataset 
will be made available to other researchers to further validate and expand our findings. 

Index Terms —Nonverbal Behavior Prediction, Job Interviews, Multimodal Interactions, Regression. 
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1 Introduction 

Analysis of non-verbal behavior to predict the outcome 
of a social interaction has been studied for many years 
in different domains, with predictions ranging from 
marriage stability based on interactions between new- 
l 5 rwed couples [1], [2], to patient satisfaction based on 
doctor-patient interaction [3], to teacher evaluation by 
analyzing classroom interactions between a teacher and 
the students [4]. However, many of these prediction 
frameworks were based on manually labeled behavioral 
patterns by trained coders, according to carefully de¬ 
signed coding schemes. Manual labeling of nonverbal 
behaviors is laborious and time consuming, and there¬ 
fore often does not scale with large amounts of data. 
As a scalable alternative, several automated prediction 
frameworks have been proposed based on low-level 
behavioral features, automatically extracted from larger 
datasets. Due to the challenges of collecting and analyz¬ 
ing multimodal data, most of these automated methods 
focused on a single modality of interaction [5], [6], [7], 
[8]. In this paper, we address the challenge of auto¬ 
mated understanding of multimodal human interactions, 
including facial expression, prosody, and language. We 
focus on predicting social interactions in the context of 
job interviews for college students, which is an exciting 
and relatively less explored domain. 

Job interviews are ubiquitous and play inevitable 
and important roles in our life and career. Over many 
years, social psychologists and career coaches have ac¬ 
cumulated knowledge and guidelines for success in job 
interviews [9], [10], [11]. Studies in social psychology 



Fig. 1. Framework of Analysis. Mechanical Turk workers 
rated interviewee performance by watching videos of job 
interviews. Various features were extracted from those 
videos. A framework was built to predict Turker’s rating 
and to gain insight into the characteristics of a good 
interview. 


have shown that smiling, using a louder voice, and 
maintaining eye contact contribute positively to our 
interpersonal commrmications [9], [11]. These guidelines 
are largely based on intuition, experience, and studies 
involving manual encoding of nonverbal behaviors on 
a limited amount of data [9]. Automated data-driven 
quantification of both verbal and non-verbal behaviors 
simultaneously has not been explored in the context of 
job interviews. In this paper, we aim to quantify the 
determinants of a successful job interview using a com- 
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putational prediction framework based on automatically 
extracted features, which takes both verbal speech and 
non-verbal behaviors into account. 

Imagine the following scenario in which two students, 
John and Matt, were individually asked to discuss their 
leadership skills in a job interview. John responded with 
the following: 

"One semester ago, I was part of a team of ten 
students [stated in a loud and clear voice]. We 
worked together to build an autonomous playing 
robot. I led the team by showing how to program the 
robot. The students did a wonderful job [conveyed 
excitement with tone]I In ten weeks, we made the 
robot play soccer. It was a lot of fun. [concluded 
with a smile]". 

Matt responded with the following: 

"Umm ... [paused for 2 seconds] Last semester I led 
a group in a class project on robot programming. 

It was a totally crazy experience. The students did 
almost nothing until the last moment. ... Umm 
... Basically, I had to intervene at that point and 
led them to work hard. Eventually, this project 
was completed successfully, [looked away from the 
interviewer]". 

Who do you think received higher ratings? 

Most would agree that the first interviewee, John, 
provided the more enthusiastic and engaging answer. 
We can easily interpret the meaning of our verbal 
and nonverbal behavior during face-to-face interactions. 
However, we often carmot quantify how the combination 
of these behaviors affects our interpersonal commu¬ 
nications. Previous research [12] shows that the style 
of speaking, prosody, facial expression, and language 
reflect valuable information about one's personality and 
mental states. Understanding the relative influence of 
these individual modalities can provide crucial insight 
regarding job interviews. 

In this paper, we attempt to answer the following 
research questions by analyzing the audio-visual record¬ 
ings of 138 interview sessions with 69 individuals: 

• Can we automatically quantify verbal and nonver¬ 
bal behavior, and assess their role in the overall 
rating of job interviews? 

• Can we build a computational framework that can 
automatically predict the overall rating of a job 
interview given the audio-visual recordings? 

• Can we infer the relative importance of language, 
facial expressions, and prosody (intonation)? 

• Can we make automated recommendations on im¬ 
proving social traits such as excitement, friendliness, 
and engagement in the context of a job interview? 

To answer these research questions, we designed and 
implemented an automated prediction framework for 
quantifying the ratings of job interviews, given the 
audio-visual recordings. The proposed prediction frame¬ 
work (Figure 1) automatically extracts a diverse set of 
multimodal features (lexical, facial, and prosodic), and 


quantifies the overall interview performance, the likeli¬ 
hood of getting hired, and 14 other social traits relevant 
to the job interview process. Our system is capable of 
predicting the overall rating of a job interview with a 
correlation coefficient r > 0.65 and AUC = 0.81 (baseline 
0.50) on average. We can also predict different social 
traits such as engagement, excitement, and friendliness 
with even higher accuracy (r > 0.75, AUC > 0.85). 
Furthermore, we investigate the relative weights of the 
individual verbal and non-verbal features learned by 
our regression models, and quantify their relative im¬ 
portance in the context of job interviews. Our prediction 
model can be integrated with the existing automated 
interview coaching systems, such as MACH [13], to 
provide more intelligent and quantitative feedback. The 
interview questions asked in our training dataset are 
chosen to be independent of any job specifications or 
skill requirements. Therefore, the ratings predicted by 
our model are based on social and behavioral skills only, 
and they may differ from a hiring manager's opinion, 
given a specific job. 

Parts of the research included in this article have been 
presented in [14]. In this article, we present an improved 
system by including additional facial features and pro¬ 
vide more comprehensive results and analysis. The re¬ 
maining structure of the article follows. In Section 2, we 
discuss the background research on automated quan¬ 
tification of multimodal nonverbal behaviors. Section 3 
describes the interview dataset and the data annotation 
process via Mechanical Turk. A detailed discussion of the 
proposed computational framework, feature extraction, 
and automated prediction is presented in Section 4. We 
present our detailed results in Section 5. Finally, we 
conclude with our findings and discuss our future work 
in Section 6. 

2 Background Research 

In this section, we discuss existing relevant work on non¬ 
verbal behavior prediction using automatically extracted 
features. We particularly focus on the social cues that 
have been shown to be relevant to job interviews and 
face-to-face interactions [9]. We also discuss previous 
research on automated conversational systems for job 
interviews, which is one of the potential applications we 
envision for the proposed prediction framework. 

2.1 Nonverbal Behavior Recognition 

Nonverbal behaviors are subfle, fleeting, subjective, and 
sometimes even contradictory. Even a simple facial ex¬ 
pression such as a smile can have different meanings, 
e.g., delight, rapport, sarcasm, and even frustration [15]. 
Edward Sapir, in 1927, referred fo non-verbal behavior 
as "an elaborate and secret code that is written nowhere, 
known by none, but understood by all" [16]. Despite 
years of research, nonverbal behavior predicfion remains 
a challenging problem. Gotfman et al. [1], [2] studied 
verbal and non-verbal interactions between newl 5 rwed 
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couples and developed mathematical models to predict 
marriage stability and chances of divorce. For exam¬ 
ple, they found that the greatest predictor of divorce 
is contempt, which must be avoided for a successful 
marriage. Hall et al. [3] studied the non-verbal cues in 
doctor-patient interaction and showed that doctors who 
are more sensitive to nonverbal skills received higher 
ratings of service during patient visits. Ambady et al. [4] 
studied the interactions of teachers with students in 
a classroom and proposed a framework for predicting 
teachers' evaluations based on short clips of interactions. 
However, these prediction frameworks were based on 
manually labeled behavioral patterns. Manually labeling 
non-verbal behaviors is laborious and time consuming, 
and is often not scalable to large amounts of data. 

To allow for the analysis of larger datasets of social 
interacfions, several automated prediction frameworks 
have been proposed. Due to the challenges of collecting 
and analyzing multimodal data, most of the existing 
automated prediction frameworks focus on a single be¬ 
havioral modality, such as prosody [8], [17], [18], facial 
expression [6], gesture [7], and word usage pattern [19]. 
Analysis based on a single modality is likely to overlook 
many critical non-verbal behaviors, and hence there has 
been a growing interest in analyzing social behaviors in 
more than a single modality. 

Ranganath et al. [20], [21] studied social interactions 
in speed-dates using a combination of prosodic and lin¬ 
guistic features. The analysis is based on the SpeedDate 
corpus, a spoken corpus of approximafely 1000 4-min- 
speed-dates, where each participant rated his/her date in 
terms of four different conversational styles (awkward¬ 
ness, assertiveness, flirtatiousness, and friendliness) on a 
ten point Likert scale. Given the speech data, Ranganath 
et al. proposed a computational framework for predict¬ 
ing these four conversafional sfyles using prosodic and 
linguistic features only, while ignoring facial expressions. 
Sfark et al. [22] were able to reliably predict the nature 
of a telephone conversation (business versus personal, 
familiar versus unfamiliar) using fhe lexical and prosodic 
features extracted from as few as 30 words of speech 
af fhe beginning of the conversation. Kapoor et al. [12] 
and Pianesi et al. [23] proposed systems to recognize 
different social and personality traits by exploiting only 
prosody and visual features. Sanchez et al. [24] proposed 
a system for predicfing eleven different social moods 
(e.g., surprise, anger, happiness) from YouTube video 
monologues, which consist of different social dynamics 
than face to face interactions. 

Perhaps the most relevant, Nguyen et al. [25] pro¬ 
posed a computational framework fo predict the hir¬ 
ing decision using nonverbal behavioral cues extracted 
from a dafasef of 62 interview videos. Nguyen et al. 
considered only nonverbal cues, and did not include 
verbal content in their analysis. Our work extends the 
current state-of-the-art and generates new knowledge 
by incorporating three different modalities (prosody, 
language, and facial expressions), and fifteen different 


social traits (e.g., friendliness, excitement, engagement), 
and quantifies the interplay and relative influences of 
fhese differenf modalifies for each of the different social 
traits. Furthermore, by analyzing the relative feature 
weights learned by our regression models, we obtain 
valuable insights about behaviors that are recommended 
for success in job inferviews (Section 5.2.3). 

2.2 Social Coaching for Job Interviews 

Several automated systems have been proposed for 
coaching fhe necessary social skills to succeed in job 
interviews [13], [26], [27]. Hoque et al. [13] developed 
MACH (My Automated Conversation coacH), which 
allows users to improve social skills by interacting with 
a virtual agent. The MACH system records videos of the 
user using a webcam and a microphone, and provides 
feedback regarding several low level behavioral patterns, 
e.g., average smile intensity, pause duration, speaking 
rate, pitch variation, etc. 

Anderson et al. [26] proposed an interview coaching 
system, TARDIS, which presents the training interac¬ 
tions as a scenario-based "serious game". The TARDIS 
framework incorporates a sub-module named Nov A 
(NonVerbal behavior Analyzer) [27] that can recognize 
several lower level social cues: hands-to-face, looking away, 
postures, leaning forward/backward, gesticulation, voice activ¬ 
ity, smiles, and laughter. Using videos that are manually 
annotated with these ground truth social cues, Nov A 
trains a Bayesian Network that can infer higher-level 
menfal fraits (e.g., stressed, focused, engaged, etc.). Au¬ 
tomated prediction of higher-level fraits remains part of 
their future work. 

Our framework (1) quanfifies the relative influences 
of different low level features on the interview outcome, 
(2) learns regression models to predict interview ratings 
and the likelihood of hiring using automatically ex¬ 
tracted features, and (3) predicts several other high-level 
personality traits such as engagement, friendliness, and 
excitement. One of our objectives is to extend the existing 
automated conversation systems by providing feedback 
on the overall interview performance and additional 
high-level personality traits. 


3 Dataset Description 

We used the MIT Interview Dataset [13], which consists 
of 138 audio-visual recordings of mock interviews with 
internship-seeking students from Massachusetts Institute 
of Technology (MIT). The total duration of our interview 
videos is nearly 10.5 hours (on average, 4.7 minutes 
per interview, for 138 interview videos). To our knowl¬ 
edge, this is the largest collection of interview videos 
conducted by professional counselors under realistic set¬ 
tings. The following sections provide a brief description 
of the data collection and ground truth labeling. 
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Fig. 2. The experimental setup for collecting audio-visual 
recordings of the mock interviews. Camera #1 recorded 
the video and audio of the interviewee, while Camera #2 
recorded the interviewer. 

3.1 Data Collection 

3.1.1 Study Setup 

The mock interviews were conducted in a room 
equipped with a desk, two chairs, and two wall-mounted 
cameras, as shown in Figure 2. The two cameras with 
microphones were used to capture the facial expressions 
and the audio conversations during the interview. 

3.1.2 Participants 

Initially, 90 MIT juniors participated in the mock in¬ 
terviews. All participants were native English speakers. 
The interviews were conducted by two professional MIT 
career counselors who had over five years of experience. 
For each participant, two rounds of mock interviews 
were conducted: before and after interview interven¬ 
tion. For the details of interview intervention, please 
see [13]. Each individual received $50 for participating. 
Furthermore, as an incentive for the participants, we 
promised to forward the resume of the top 5% candi¬ 
dates to several sponsor organizations (Deloitte, IDEO, 
and Intuit) for consideration for summer internships. We 
chose sponsor organizations which are not directly tied 
to any specific major. After the data collection, 69 (26 
male, 43 female) of the 90 initial participants permitted 
the use of their video recordings for research purposes. 

3.1.3 Procedure 

During each interview session, the counselor asked in¬ 
terviewees five different questions, which were recom¬ 
mended by the MIT Career Services. These five questions 
were presented in the following order by the counselors 
to the participants: 

Ql. So please tell me about yourself. 

Q2. Tell me about a time when you demonstrated 
leadership. 

Q3. Tell me about a time when you were working 
with a team and faced a challenge. How did you 
overcome the problem? 

Q4. What is one of your weaknesses and how do 


TABLE 1 

List of questions asked to Mechanical Turk workers. First 
two questions are related to interviewee performances. 
Others are on various traits of their behavior 


Traits 

Description 

Overall Rating 

The overall performance rating. 

Recommend Hiring 

How likely is he to be hired? 

Engagement 

Did he use engaging tone? 

Excitement 

Did he seem excited? 

Eye Contact 

Did he maintain proper eye contact? 

Smile 

Did he smile appropriately? 

Eriendliness 

Did he seem friendly? 

Speaking Rate 

Did he maintain a good speaking rate? 

No Fillers 

Did he use too many filler words? 

(1 = too many, 7 = no filler words) 

Paused 

Did he pause appropriately? 

Authentic 

Did he seem authentic? 

Calm 

Did he appear calm? 

Focused 

Did he seem focused? 

Structured Answers 

Were his answers structured? 

Not Stressed 

Was he stressed? 

(1 = too stressed, 7 = not stressed) 

Not Awkward 

Did he seem awkward? 

(1 = too awkward, 7 = not awkward) 


you plan to overcome it? 

Q5. Now, why do you think we should hire you? 

No job description was given to the interviewees. 
The five questions were chosen to assess the inter¬ 
viewee's behavioral and social skills. The interviewers 
rated the performances of the interviewees by answering 
16 assessment questions on a seven point Likert scale. 
We list these questions in Table 1. These questions to 
the interviewers were selected to evaluate the overall 
performance and behavioral traits of the interviewees. 
The first two questions - "Overall Rating" and "Rec¬ 
ommend Hiring" - represent the overall performance. 
The remaining questions have been selected to eval¬ 
uate several high-level behavioral dimensions such as 
warmth (e.g., "friendliness", "smiling"), presence (e.g., 
"engagement", "excitement", "focused"), competence 
(e.g. speaking rate), and content (e.g., "structured"). 

3.2 Data Labeling 

The subjective nature of human judgment makes it 
difficult to collect ground truth for interview ratings. 
Due to the nature of the experiment, the counselors 
interacted with each interviewee twice - before and after 
intervention, and provided feedback after each session. 
The process of feedback and the way the interviewees 
responded to the feedback may have had an influence 
on the counselor's ratings. In order to remove the bias 
introduced by the interaction, we used Amazon Mechan¬ 
ical Turk workers to rate the interview performance. 
The Mechanical Turkers used the same questionnaire 
to assess the ratings as listed in Table 1. Apart from 
being less affected by bias, the Mechanical Turk workers 
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could pause and replay the video, allowing them to rate 
more thoroughly. The Turkers' ratings are more likely to 
be similar to "audience" ratings, as opposed to "expert 
ratings". 

In order to collect ground truth ratings for interviewee 
performances, we first selected 10 Turkers out of 25, 
based on how well they agreed with the career coun¬ 
selors on the five control videos. Out of these 10 selected 
Turkers, one did not finish all the rating tasks, leaving 
us with 9 ratings per video. We automatically estimated 
the quality of individual workers using an EM-style op¬ 
timization algorithm, and estimated a weighted average 
of their scores as the ground truth ratings, which were 
used in our prediction framework. 

One of our objectives was to model the temporal rela¬ 
tionships among the individual interview questions and 
the overall ratings. To accomplish this, we performed a 
second phase of labeling. We hired a different set of 5 
Turkers for rating the performances of an interviewee 
in each of the five interview question separately. This 
was done by splitting each interview video into five 
different segments, where each segment corresponds to 
one of the interview questions. The video segments were 
shuffled so that each Turker would rate the segments in 
a random order. These per-question ratings were used 
only to analyze the temporal variation in the ratings 
and measure how the temporal order of the questions 
correlates with the ratings for entire interview. 

4 Prediction Framework 

For the prediction framework, we automatically ex¬ 
tracted various features from the videos of the inter¬ 
views. Then we trained two regression algorithms - SVM 
and LASSO. The objective of this training is twofold: 
first, to predict the Turker's ratings on the overall perfor¬ 
mance and each behavioral trait, and second, to quantify 
and gain meaningful insights on the relative importance 
of each modality and the interplay among them. 

4.1 Feature Extraction 

We collected three types of features for each interview 
video: (1) prosodic features, (2) lexical features, and 
(3) facial features. We selected these features to reflect 
the behaviors that have been shown to be relevant in 
job interviews (e.g., smile, intonation, language con¬ 
tent, etc.) [9], and also based on the past literature on 
automated social behavior recognition [24], [20], [21], 
[18]. For extracting reliable lexical features, we chose 
not to use automated speech recognition. Instead, we 
transcribed the videos by hiring Amazon Mechanical 
Turk workers, who were specifically instructed to in¬ 
clude filler and disfluency words (e.g., "uh", "umm", 
"like") in the transcriptions. Our lexical features were 
extracted from these transcripts. We also collected a wide 
range of prosodic and facial features. 


TABLE 2 

List of prosodic features and their brief descriptions 


Prosodic Feature 

Description 

Energy 

Mean spectral energy. 

FO MEAN 

Mean FO frequency. 

FO MIN 

Minimum FO frequency. 

FO MAX 

Maximum FO frequency. 

FO Range 

Difference between FO MAX and FO MIN. 

FO SD 

Standard deviation of FO. 

Intensity MEAN 

Mean vocal intensity. 

Intensity MIN 

Minimum vocal intensity . 

Intensity MAX 

Maximum vocal intensity . 

Intensity Range 

Difference between max and 
min intensity. 

Intensity SD 

Standard deviation. 

FI, F2, F3 MEAN 

Mean frequencies of the first 3 
formants: FI, F2, and F3. 

FI, F2, F3 SD 

Standard deviation of FI, F2, F3. 

FI, F2, F3 BW 

Average bandwidth of FI, F2, F3. 

F2/F1 MEAN 

Mean ratio of F2 and FI. 

F3/F1 MEAN 

Mean ratio of F3 and FI. 

F2/F1 SD 

Standard deviation of F2/F1. 

F3/F1 SD 

Standard deviation of F3/F1. 

Jitter 

Irregularities in FO frequency. 

Shimmer 

Irregularities in intensity. 

Duration 

Total interview duration. 

% Unvoiced 

Percentage of unvoiced region. 

% Breaks 

Average percentage of breaks. 

maxDurPause 

Duration of the longest pause. 

avgDurPause 

Average pause duration. 


4.1.1 Prosodic Features 

Prosody reflects our speaking style, particularly the 
rhythm and the intonation of speech. Prosodic features 
have been shown to be effective for social intent model¬ 
ing [8], [17], [18]. To distinguish between the speech of 
the interviewer and the interviewee, we manually anno¬ 
tated the beginning and end of each of the interviewee's 
answers. We extracted and analyzed prosodic features 
of the interviewee's speech. Each prosodic feature is 
first collected over an interval corresponding to a single 
answer by the interviewee, and then averaged over all 
her/his five answers. We used the open-source speech 
analysis tool PRAAT [28] for prosody analysis. 

The important prosodic features include pitch infor¬ 
mation, vocal intensities, characteristics of the first three 
formants, and spectral energy, which have been reported 
to reflect our social traits [17]. To reflect the vocal 
pitch, we extracted the mean and standard deviation of 
fundamental frequency FO (FO MEAN and FO SD), the 
minimum and maximum values (FO MIN, FO MAX), and 
the total range (FO MAX - FO MIN). We extracted similar 
features for voice intensity and the first 3 formants. Ad¬ 
ditionally, we collected several other prosodic features 
such as pause duration, percentage of unvoiced frames, 
jitter (irregularities in pitch), shimmer (irregularities in 
vocal intensity), percentage of breaks in speech, etc. 
Table 2 shows the complete list of prosodic features. 
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TABLE 3 

LIWC Lexical features used in our system. 


LIWC Category 

Examples 

I 

I, I'm, I've, I'll, I'd, etc. 

We 

we, we'll, we're, us, our, etc. 

They 

they, they're, they'll, them, etc. 

Non-fluencies 

words introducing non-fluency in 
speech, e.g., uh, umm, well. 

PosEmotion 

words expressing positive emotions, 
e.g., hope, improve, kind, love. 

NegEmotion 

words expressing negative emotions, 
e.g., bad, fool, hate, lose. 

Anxiety 

nervous, obsessed, panic, shy, etc. 

Anger 

agitate, bother, confront, disgust, etc. 

Sadness 

fail, grief, hurt, inferior, etc. 

Cognitive 

cause, know, learn, make, notice, etc. 

Inhibition 

refrain, prohibit, prevent, stop, etc. 

Perceptual 

observe, experience, view, watch, etc. 

Relativity 

first, huge, new, etc. 

Work 

project, study, thesis, university, etc. 

Swear 

Informal and swear words. 

Articles 

a, an, the, etc. 

Verbs 

common English verbs. 

Adverbs 

common English adverbs. 

Prepositions 

common prepositions. 

Conjunctions 

common conjunctions. 

Negations 

no, never, none, cannot, don't, etc. 

Quantifiers 

all, best, bunch, few, ton, unique, etc. 

Numbers 

words related to number, e.g., 
first, second, hundred, etc. 


4.1.2 Lexical features 

Lexical features can provide valuable information re¬ 
garding the interview content and the interviewee's per¬ 
sonality. One of the most commonly used lexical features 
is the unigram counts for each individual word. How¬ 
ever, treating unigram counts as features often results 
in sparse high-dimensional feature vectors, and suffers 
from the "curse of dimensionality" problem, especially 
for a limited sized corpus. 

We address this challenge with two techniques. First, 
instead of using raw unigram counts, we employed 
counts of various psycholinguistic word categories de¬ 
fined by the tool "Linguistic Inquiry Word Count" 
(LIWC) [29]. The LIWC categories include words de¬ 
scribing negative emotions (sad, angry, etc.), positive 
emotions (happy, kind, etc.), different function word cat¬ 
egories (articles, quantifiers, pronouns, etc.), and various 
content categories (e.g., anxiety, insight). We selected 
23 such LIWC word categories, which is significantly 
smaller than the number of individual words. The LIWC 
categories correlate with various psychological traits, 
and often provide indications about our personality and 
social skills [19]. Many of these categories are intuitively 
related to interview performance. Table 3 shows the com¬ 
plete list of the LIWC features used in our experiments. 

Although the hand coded LIWC lexicon has proven 
to be useful for modeling many different social be- 


TABLE 4 

Additional features related to speaking rate and fluency. 


feature Name 

Description 

wpsec 

Words per second. 

upsec 

Unique words per second. 

fpsec 

Filler words per second. 

wc 

Total number of words. 

uc 

Total number of unique words. 


haviors, the lexicon is predefined and may not cover 
many important aspects of job interviews. To address 
this challenge, we aimed to automatically learn a lexicon 
from the interview dataset. We apply the Latent Dirichlet 
Allocation (LDA) [30] method to automatically learn 
common topics from our interview dataset. We set the 
number of topics to 20. For each interview, we estimate 
the relative weights of these learned topics, and use 
these weights as lexical features. Similar ideas have been 
exploited by Ranganath et al. [20], [21] for modeling 
social traits in speed dating dataset, but they used deep 
auto-encoders [31] instead of LDA. 

Finally, we collected additional lexical features that 
correlate to job interview ratings. These are features 
related to our linguistic and speaking skills. Table 4 
contains the full list. Similar speaking rate and fluency 
features were exploited by Zechner et al. [18] in the 
context of automated scoring of non-native speech in 
TOEFL practice tests. 

4.1.3 Facial features 

We extracted facial features for the interviewees from 
each frame in the video. First, faces were detected us¬ 
ing the Shore [32] framework. We trained a classifier 
to distinguish between neutral and smiling faces. The 
classifier is trained using the AdaBoost algorithm. The 
classifier output is normalized in the range [0,100], where 
0 represents no smile, and 100 represents full smile. Fi¬ 
nally, we averaged the smile intensities from individual 
frames, and used this as a feature in our model. We 
also extracted head gestures such as nods and shakes 
as explained in [13]. 

In addition to the smile intensity and head ges¬ 
tures (nod and shake), we also extracted a number of 
other facial features using a Constrained Local Model 
(CLM) [33] based face tracker^, as illustrated in Fig 3. 
The face tracker detects 66 interest points on a face 
image. It works by fitting the following parametric shape 
model [33] [34]: 

X, = sR(xi -I- ’®'iq) -I-t, (1) 

where x^ is the coordinate of i* interest point and 
5Ci denotes its mean location pre-trained from a large 
collection of hand-labeled training images, denotes 
the bases of local variations for the interest point. 

1. https://github.com/kylemcdonald/FaceTracker 
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Fig. 3. Illustration of facial features: OBH (outer eye¬ 
brow height), IBH (inner eye-brow height), OLH (outer lip 
height), ILH (inner lip height), eye opening, and LipCDT 
(lip corner distance). 


Each element of the vector q represents a coefficient cor¬ 
responding to a basis of local variation. The parameters 
s, R, and t corresponds to the global transformations 
associated with scaling, rotation, and translation respec¬ 
tively. The face tracker adjusts the model parameters 
p = {s,R, q, t} so that each of the mean interest points 
(xi) fits best to its corresponding point (x^) on the test 
face. 

While extracfing features from these tracked interest 
points, we want to disregard the global transformations 
(translation, rotation, and scaling), and consider only the 
local transformations, which provide useful information 
regarding our facial expressions. After the face fracker 
converges to an optimal estimate of the parameters, we 
recalculate each of the interest points x^ by applying 
the local transformations only, while disregarding the 
global transformations (s, R, and t). Mathematically, we 
calculate the following shape model from the optimal 
parameters obtained from the face fracker: 


Xi = (Xi + 'i'iq) 


( 2 ) 


Once we find x^, we calculate the distances between 
the corresponding interest points to find ouf fhe fea- 
fures OBH (oufer eye-brow height), IBH (inner eye¬ 
brow height), OLH (outer lip height), and ILH (inner lip 
height), eye opening, and LipCDT (lip corner distance), 
as illustrated in Ligure 3. By disregarding the global 
transformation parameters, the extracted facial feafures 
are invarianf fo global translations, rotations, and scaling 
variations. In addition to the features shown in Ligure 3, 
we separately incorporated three head pose features 
(Pitch, Yaw and Roll), based on the corresponding el¬ 
ements of the rotation matrix R. 


4.1.4 Feature Normalization 

We concatenate the three types of feafures described 
above and obtain one combined feature vector. To re¬ 
move any possible bias related to the range of values 
associated with a feature, we normalized each feature to 
have zero mean and unit variance, which allows treating 
all the features uniformly. 


4.2 Ground Truth Ratings and Turker Quality Estima¬ 
tion 

We aim to automatically estimate the reliability of each 
Turker, and fhe ground trufh ratings based on fhe Turk- 
ers' ratings. We adapt a simplified version of the existing 
latent variable models [35] that treat each Turker's reli¬ 
ability and ground truth ratings as latent parameters, 
estimate their values using an EM-style iterative opti¬ 
mization technique. 

Let us assume an input training dataset V = 
containing N feature vectors x^ (one for 
each interview video), for which fhe ground trufh label 
Pi is nof known. Insfead we acquire subjecfive labels 
{y],... from K Turkers on seven point Likert scale, 
i.e., yl S {1,2,...,7}. Given this dataset V, our goal is 
to learn the true rating {yi) and the reliability of each 
worker (Xj). 

To simplify fhe estimation problem, we assume the 
Turkers' ratings to be real numbers, i.e., yf S K. We also 
assume that each Turker's rating is a noisy version of 
fhe true rating pi G M, perturbed via additive Gaussian 
noise. Therefore, fhe probability distribution for fhe yl: 

P^vl ll/o Aj] = M{yl\y^, l/\j) (3) 


where Xj is fhe unknown inverse-variance and the mea¬ 
sure of reliability for fhe Turker. By faking logarithm 
on both side and ignoring constant terms, we get the 
log-likelihood function: 


N K 


^ = EE 


i=i j=i 


ilogAj- - ^{yl -y,f 


(4) 


The log-likelihood function is non-convex in pi and 
Xj variables. However, if we fix pi, fhe log-likelihood 
function becomes convex with respect to Xj, and vice- 
versa. Assuming Xj fixed, and sefting ^ = 0, we obtain 
the update rule: 


Vi = 



(5) 


Similarly, assuming pi fixed, and setting = 0, we 
obtain the update rule: 


^7 = 


N 


( 6 ) 


We alternately apply the two update rules for pi and 
Xj for i = 1,... ,N and j = 1,..., until convergence. 
After convergence, the estimated yi values are treated as 
ground truth ratings and used for framing our prediction 
models. 


















4.3 Score Prediction from Extracted Features 

Using the features described in the previous section, we 
train regression models to predict the interview scores. 
We also train models to predict other interview-specific 
traits such as excitement, friendliness, engagement, awk¬ 
wardness, etc. We experimented with many different 
regression models: Support Vector Machine Regression 
(SVR) [36], Lasso [37], Li Regularized Logistic Regres¬ 
sion, Gaussian Process Regression, etc. We will only 
discuss SVR and Lasso, which achieved the best results 
with our dataset. 

4.3.1 Support Vector Regression (SVR) 

The Support Vector Machine (SVM) is a widely used 
supervised learning method. In this paper, we focus 
on the SVMs for regression, in order to predict the 
performance ratings from interview features. Suppose 
we are given a training data {(xi, ),..., (xjv, ?/ 7 v))}, 

where x^ € is a d-dimensional feature vector for the 
interview in the training set. For each feature vector 
Xi, we have an associated value j/i € K+ denoting the 
interview rating. Our goal is to learn the optimal weight 
vector w € and a scalar bias term & G K. such that the 
predicted value for the feature vector x is: y = w^x + b. 
We minimize the following objective function: 

1 ^ 

minimize -||w|pG + I*) 

w.6.6.6 ^ 

subject to yi — — b < c + \/i (7) 

w^Xi + b-yi<e + li, Wi 

V^ 

The e > 0 is the precision parameter specifying the 
amount of deviation from the true value that is allowed, 
and are the slack variables to allow deviations 

larger than e. The timable parameter G > 0 controls 
the tradeoff between goodness of fit and generalization 
to new data. The convex optimization problem is often 
solved by maximizing the corresponding dual problem. 
In order to analyze the relative weights of different 
features, we transform it back to the primal problem 
and obtain the optimal weight vector w* and bias term 
b*. The relative importance of the feature can be 
interpreted by the associated weight magnitude \w*\. 

4.3.2 Lasso 

The Lasso regression method aims to minimize the resid¬ 
ual prediction error in the presence of an Li regulariza¬ 
tion function. Using the same notation as the previous 
section, let the training data be {(xi, yi),..., (xjv, ?/ 7 v))}- 
Let our linear predictor be of the form: y = w^x -|- b. 
The Lasso method estimates the optimal w and b by 
minimizing the following objective function: 

N 

minimize (yi - w'^x^ - b) 

w,b f (5) 

subject to ||w||i < A 


Average KrippendorfF 

Correlation Alpha 


oooooooo ooooooooo 



Fig. 4. The inter-rater agreement among the turkers, 
measured by the Krippendorff’s Aipha (varies in the range 
[-1,1]) and the average one-vs-rest correiation of their 
ratings (range [-1,1]). 

where A > 0 is the regularization constant, and ||w||i = 
J^j=i l^il riorm of w. The Li regularization is 

known to push the coefficients of the irrelevant features 
down to zero, thus reducing the predictor variance. We 
control the amount of sparsity in the weight vector w by 
tuning the regularization constant A. 

5 Results 

We organize our results in two sections. First, we analyze 
the ratings by Mechanical Turk workers (Section 5.1). 
The quality and reliability of Turkers' ratings are as¬ 
sessed by observing how well the Turkers agree with 
each other (Section 5.1.1). In addition, we identify which 
traits are important to succeed in job interviews by 
measuring the correlations of the ratings for individual 
traits with the overall ratings (Section 5.1.2). Further¬ 
more, we examine the correlations between the ratings 
for individual video segments with that for the entire 
videos. This allowed us to evaluate the temporal patterns 
in job interviews (Section 5.1.3). 

In Section 5.2, we present the prediction accuracies for 
the trained regression models (SVR and Lasso) based 
on automatically extracted features, and analyze the 
relative influence of different modalities and features on 
prediction accuracy. 

5.1 Analysis of Mechanical Turk Dataset 

5.1.1 Inter-Rater Agreement 

To assess the quality of the ratings, we calculate Krip- 
pendorff's Alpha [38] for each trait. In this case, Krip- 
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Correlation Mutual Information (bits) into what constitutes a good interview. 
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Fig. 5. Correlation and Mutual information between over¬ 
all rating and ratings on other traits. 


pendorff's Alpha is more meaningful than the frequently 
used Fleiss' Kappa [39], as the ratings are ordinal values 
(on a 7-point Likert scale). The value of Krippendorff's 
Alpha can be any real number in the range [—1,1], with 
1 being the perfect agreement and -1 being absolute 
disagreement among the raters. We also estimate the 
correlation of each Turker's rating with the mean rating 
by the other Turkers for each trait. Figure 4 shows that 
some traits have relatively good inter-rater agreement 
among the Turkers (e.g., "engagement", "excitement", 
"friendliness"). Some other traits such as: "stress", "au¬ 
thenticity", "speaking rate", and "pauses" have low 
inter-rater agreement. This may be because the Turkers 
were not in a position to judge those categories with the 
video data only. 

5.1.2 Correlation among the Behavioral Traits 

We are interested in identifying the traits that correlate 
highly with overall ratings. This knowledge can help 
interviewees understand the most important behavioral 
traits in job interviews. We plot the mutual information 
and correlation between various ratings given by the 
Mechanical Turk workers and the overall rating of the 
interviewee performance in Figure 5. 

The first bar in Figure 5 represents whether the rater 
will recommend hiring the interviewee. It is another 
form of the overall rating and shows high correlation 
and mutual information with the overall rating. It is 
evident from the plot that the most important trait in 
an interview is to stay focused. This trait shows a 73% 
correlation with the overall rating. Some other top traits 
include possessing an engaging tone, not appearing 
awkward, being excited, and displaying an appropriate 
smile. The mutual information and correlation coefficient 
closely follow the patterns. This plot gives us an insight 


5.1.3 First (and Last) Impression Matters 
We would like to understand how the performance in 
different interview questions during an interview affects 
the overall rating. To understand this temporal relation¬ 
ship, we calculated the correlation and mutual informa¬ 
tion between the ratings for each individual interview 
question and the ratings for the entire videos. In Figure 
6, we plot this relationship. It is evident from Figure 6(a) 
that performance on the first question correlates most 
with the overall performance. After the first question, 
the correlation gradually decays. We can interpret this 
result as follows: If an interviewee performs well for 
the first question, it is more likely that he /she will end 
up receiving an above average rating. It is true in the 
opposite case as well; if an interviewee performs poorly 
in the first question, he/she is more likely to receive a 
poor overall rating. This finding is also supported by 
existing evidence from psychological point of view [40], 
[5]. 

A similar pattern of first impression matters holds for 
ratings on various other traits of the interviewee's be¬ 
havior, such as whether he/she was excited, smiled, 
maintained eye contact, talked in engaging tone, or even 
appeared friendly. Figure 6(b) illustrates this. We notice 
from this figure that there is a sudden spike in correlation 
for the last question. This indicates the fact that, although 
the first question matters the most, the interviewee can 
significantly change the interviewer's perception during 
the response to the final question. 

Figure 6(c) shows some traits (e.g., pause, calmness, 
stress) do not follow the pattern discussed above. How¬ 
ever, they have very low correlation values to begin with. 
We believe it is difficult for Mechanical Turk workers to 
accurately judge these traits as these judgments demand 
considerable concentration. 

We need to be cautious while interpreting this result. 
Although the ratings for the first question had maximum 
correlation with the overall ratings for the entire inter¬ 
view, we can not say whether it is due to the temporal or¬ 
der or the verbal content of the question itself. However, 
we would like to emphasize that our mock interviews 
start with a question about interviewee's background, 
which is consistent with many real-world job interviews. 

5.2 Prediction using Automated Features 

5.2.1 Prediction Accuracy using Trained Models 
Given the feature vectors associated with each interview 
video, we would like to provide feedback to users about 
their overall performance in the interview, the likelihood 
of getting an offer, and insights into other personality 
traits that are relevant for job interviews. We train re¬ 
gression models for predicting ratings for a total of 16 
traits or rating categories (as shown in Table 1). 

The entire dataset has a total of 138 interview videos 
(for the 69 participants, 2 interviews for each partici¬ 
pant). We used 80% of the videos for training, and the 
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overall performance overall performance 


Fig. 6. Correlation between ratings of different segments and the rating on the whole interview. 


remaining 20% for testing. To avoid any artifacts related 
to how we split the data, we performed 1000 random 
trials. In each trial, we randomly select 80% videos for 
training, and use the rest for testing. We report our 
results averaged over these 1000 independent trials. In 
each trial, we trained 16 different regression models for 
all 16 traits. For each of the traits, we used exactly the 
same set of features. The model automatically learned 
the weights for individual features for each trait. 

We measure prediction accuracy by the correlation 
coefficients between the true ratings and predicted rat¬ 
ings in the test set. Figure 7 displays the correlation 
coefficients for different traits, both with SVM and Lasso. 
The traits are shown in the order of their correlation 
coefficients. We observe that we can predict several 
traits with 0.75 or higher correlation coefficients: en¬ 
gagement, excitement, and friendliness. Furthermore, we 
performed well in predicting overall performance and 
hiring recommendation scores (r 0.70 for SVM), which 
are the two most important scores for interview decision. 

We also evaluate the learned regression models for a 
two-class classification task. For each trait, we split the 
interviews into two groups by the median value for that 
trait. Any interview with a score higher than the median 
value for a particular trait is considered to be in the 
positive class (for that trait), and the rest are placed in 
the negative class. We then vary the threshold on the 
predicted scores by our regression models in the range 
[1,7], and estimate the area under the Receiver Operator 
Curve (ROC). The baseline area under the curve (AUC) 
value is 0.50, as we split the classes by the median value. 
The AUC values for the learned models are presented in 
Table 5. Again, we observe high accuracies for engage¬ 
ment, excitement, friendliness, hiring recommendation, 
and the overall score {AUC > 0.80 for SVM). 

When we examine the traits with lower prediction 
accuracy, we observe: (1) either we have low interrater 
agreement for these traits, which indicates unreliable 
ground truth data (e.g., calm, stressed, structured an¬ 
swer, pause, etc.), or (2) we lack key features necessary 
to predict these traits (e.g., eye contact). In the absence of 


Excited 
Engagement 
Friendly 
Recommend Hiring 
Smiled 
Overall 
Structured Answers 
Not Awkward 
Paused 
No Fillers 
Focused 
Authentic 
Speaking Rate 
Eye Contact 
Calm 
Not Stressed 


Fig. 7. Regression coefficients using two different meth¬ 
ods: Support Vector Machine (SVM) and Lasso. 

eye tracking information (which is very difficult to ob¬ 
tain automatically), we do not have enough informative 
features to predict eye contact. 

5.2.2 Feature Analysis 

The relative weights of individual features in our regres¬ 
sion model can provide valuable insights on essential 
constituents of a job interview. To analyze this, we 
observed the features with highest weights for the SVM 
and the Lasso model. We considered five traits with 
high accuracy: overall score, recommend hiring, excite¬ 
ment, engagement, and friendliness. We considered the 
top twenty features in the order of descending weight 
magnitude, and estimate the summation of the weight 
magnitudes of the features in each of the three categories: 
prosodic, lexical, and facial features. The relative pro¬ 
portion of prosodic, lexical and facial features are illus¬ 
trated in Figure 8(a), which shows that both SVM and 
Lasso assign higher weights to prosodic features while 
predicting engagement and excitement. This indicates 
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Overall Recommend Engagement Excitement Eriendly 
Hiring 

(a) Relative proportion of the top twenty prosodic, lexical and facial (smile) features as learned by SVM and LASSO 
classifiers. The weights semantically match our perceptions on the traits 
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(b) Correlation Coefficients for SVM, for different combinations of facial (F), prosodic (P), and lexical (L) features 

Fig. 8. Analysis of reiative importance of faciai, prosodic, and iexicai features. 


TABLE 5 

The average area under the ROC curve. 


Trait 

SVM 

Lasso 

Excited 

0.904 

0.885 

Engagement 

0.858 

0.850 

Smiled 

0.845 

0.845 

Eriendly 

0.824 

0.793 

Recommend Hiring 

0.815 

0.796 

Structured Answers 

0.812 

0.799 

Not Awkward 

0.808 

0.787 

Overall 

0.805 

0.777 

No Fillers 

0.803 

0.855 

Focused 

0.791 

0.677 

Paused 

0.749 

0.749 

Authentic 

0.688 

0.642 

Eye Contact 

0.676 

0.622 

Calm 

0.651 

0.669 

Speaking Rate 

0.608 

0.546 

Not Stressed 

0.604 

0.572 


that engagement and excitement are expressed through 
prosodic features, which agrees with our intuition. Eor 
both models, the relative weights of features for pre¬ 
dicting the "overall rating" and "recommend hiring" are 
similar, which is expected, as these two traits are highly 
correlated (Eigure 5). Since we had smaller number of 
facial features, the relative weights for facial features is 
much lower. However, facial features, particularly the 
smile, were found significant for predicting friendliness. 
This result provides a solid ground for claiming that 
smile is very important in order to appear friendly. 


Eigure 8(b) shows the importance of using multimodal 
features for predicting social traits in job interviews. In 
most cases, the best correlation coefficient was obtained 
when we incorporated all three modalities. Although 
lexical features were critical for predicting overall ratings 
and likelihood of getting hired, they were not strong 
predictors of excitement, engagement, and friendliness. 
Prosodic features played important role for predicting all 
the five traits, indicating that our speaking style plays a 
critical role in job interviews. 

5.2.3 Recommendation from our Framework 
To better understand the recommended behavior in job 
interviews, we analyze the feature weights in our re¬ 
gression model. The weights with positive signs and 
higher magnitudes can potentially indicate elements of 
a successful job interview. The negative weights, on the 
other hand, indicates behaviors we should avoid. 

We sort the features by the magnitude of their weights 
and list the top twenty features (excluding the topic 
features) in Table 6. We see from this table that people 
having higher speaking rate (higher words per second 
{wpsec), total number of words {wc), and total num¬ 
ber of unique words (mc), etc.) are perceived as better 
candidates in a job interview. People who speak more 
fluently and use less filler words (lower number of 
filler words per second (fpsec), total number of filler 
words (Fillers), total number non-fluency words (Non- 
fluencies), less unvoiced region in speech (%Unvoiced), 
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and fewer breaks in speech {%Breaks)) are perceived as 
better candidates. We also find that higher interview 
score correlates with higher usage of words in LIWC 
category They (e.g. they, they'll, them, etc.) and lower 
usage of words related to 1. The overall interview per¬ 
formance and likelihood of hiring correlate positively 
with proportion of positive words, and negatively with 
proportions of negative words, which agrees with our 
experience. Individuals who smiled more performed 
better in job interviews. Finally, those speaking with a 
higher proportion of quantifiers (e.g., best, every, all, 
few), perceptual words (e.g. see, observe, know), and 
other functional word classes (articles, prepositions, con¬ 
junctions) obtained higher scores in interview. As we 
saw earlier, features related to prosody and speaking 
style are more important to appear excited and engaged. 
Particularly the amplitude, variations in the voice inten¬ 
sity, and the first 3 formants had high positive weights 
in our prediction model. Finally, besides smiling, people 
who spoke more words related to "We" than "t" were 
perceived as friendlier. 

6 Discussion and Conclusion 

We present an automated prediction framework for 
quantifying social skills for job interviews. The proposed 
model shows encouraging results and predicts human 
interview ratings with correlation r > 0.65 and AUC 
~ 0.80 (compared to the baseline AUC = 0.50). Several 
traits such as engagement, excitement, and friendliness 
were predicted with even higher accuracy (r ~ 0.75, 
AUC > 0.85). One of our immediate next steps will be to 
integrate the proposed prediction module with existing 
automated conversational systems such as MACH to 
allow valuable real-time feedback to the users. 

To our knowledge, the interview dataset used in our 
experiments is the largest collection of job interview 
videos, collected under reasonably realistic settings. The 
interviews are conducted by professional career coun¬ 
selors. We included the questions that would be relevant 
in most real-world job interviews. Despite efforts to 
record interviews in realistic settings, we do need to 
acknowledge several caveats and trade-offs. 

All the participants in our dataset were MIT under¬ 
graduates, all of junior status, which may introduce a 
selection bias in our data. In future, we plan to conduct 
a more comprehensive study over a more general and 
diverse population group. We deliberately chose not to 
specify a job description to encourage larger number of 
student participants. At the time of the study, there were 
nearly 1000 junior students present at MIT, and nearly 
30% were international students. Out of the remaining 
700 native English speaking juniors, we were able to 
recruit 90, which would have been difficult if we had 
limited our study to a specific job description. However, 
in the absence of a specific job description, the ground 
truth ratings may not necessarily correspond to actual 
hiring decisions, and may show a stronger bias towards 


non-verbal cues, as there is no specific skill requirements. 
Furthermore, our mock interviews may lack the stress 
present in a real job interviews. Although we promised 
to forward the resumes of the top 5% candidates to 
several sponsor organizations, the incentive was not as 
strong as an actual job offer. In the future, we would like 
to conduct more controlled experiments with a specific 
job description and with stronger incentives to induce 
stress and competition. 

We aimed to rate each video with multiple indepen¬ 
dent judges to avoid personal bias. As a first step, we 
recruited Turkers as this was scalable, quick, and less 
expensive. To ensure reliable ground truth ratings, each 
video was rated using 9 Mechanical Turk workers, and 
aggregated using the EM algorithm taking the reliability 
of each worker into account. However, Turkers' ratings 
may not correspond to professional experts. In future, 
we plan to collect ratings from a panel of experts, and 
re-validate the results. 

Interestingly, while training regression models using 
SVR, we obtained better prediction accuracy using the 
linear kernel, compared to other non-linear kernels (e.g., 
quadratic, cubic, or Gaussian kernels). This may indicate 
that our features do not exhibit complicated non-linear 
interactions. However, the features used in the current 
models were mostly aggregated features, averaged over 
the entire duration of the video (e.g., average pitch, 
average smile intensity). It is plausible that our smile 
and intonation while uttering a specific word can be 
a determinant of the final interview decision. The cur¬ 
rent aggregated features are incapable of modeling such 
temporal interactions. Modeling fine-grained temporal 
features across multiple modalities is left as our future 
work. 

The outcome of job interviews often depends on a 
subtle understanding of the interviewee's response. In 
our dataset, we noticed interviews in which a momen¬ 
tary mistake (e.g., the use of a swear word) ruined the 
interview outcome. Due to the rare occurrences of such 
events, it is difficult to model these phenomena, and 
perhaps anomaly detection techniques could be more 
effective instead. Extending our prediction framework 
for quantifying these diverse and complex cues in job 
interviews can provide valuable insight and understand¬ 
ing regarding job interviews and human behavior in 
general. 

Caveats aside, the results presented in this article show 
the importance of including multiple modalities while 
analyzing our social interactions. The analysis of the fea¬ 
ture weights learned by our prediction models provides 
quantitative insights to the determinants of successful 
job interviews. With the knowledge presented in this 
article, we could train a system to help underprivileged 
youth receive feedback on job interviews that require a 
significant amount of social skills. The framework could 
also be expanded to help people with social difficulties, 
train customer service professionals, or even help med¬ 
ical professionals with telemedicine. 
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TABLE 6 

Feature Analysis using the SVM model. We are listing the top twenty features ordered by their weight magnitude. We 

have excluded the topic features for the ease of interpretation. 


1 Overall 

1 RecommendHiring 

1 Excited 

1 EngagingTone 

1 Friendly | 

avgBandl 

-0.111 

wpsec 

0.138 

avgBandl 

-0.152 

intensityMax 

0.175 

smile 

0.239 

wpsec 

0.104 

avgBandl 

-0.134 

difflntMaxMin 

0.134 

avgBandl 

-0.168 

mean pitch 

0.156 

Fillers 

-0.085 

Fillers 

-0.126 

wpsec 

0.13 

difflntMaxMin 

0.155 

f3STD 

-0.11 

Quantifiers 

0.084 

percentUnvoiced 

-0.116 

intensityMax 

0.125 

intensityMean 

0.144 

LipCDT 

0.1 

avgDurPause 

-0.081 

smile 

0.101 

nod 

0.121 

wpsec 

0.132 

intensityMax 

0.098 

smile 

0.079 

PercentBreaks 

-0.094 

mean pitch 

0.118 

avgBand2 

-0.112 

difflntMaxMin 

0.095 

upsec 

0.078 

upsec 

0.088 

smile 

0.117 

flSTD 

-0.107 

intensityMean 

0.087 

percentUnvoiced 

-0.076 

avgDurPause 

-0.088 

f3STD 

-0.11 

f2STDfl 

0.101 

flSTD 

-0.086 

f3meanfl 

0.075 

intensityMean 

0.085 

intensityMean 

0.11 

Quantifiers 

0.092 

wpsec 

0.085 

Relativity 

0.074 

nod 

0.085 

flSTD 

-0.108 

intensitySD 

0.092 

Adverbs 

0.081 

Positive emotion 

-0.073 

flSTD 

-0.08 

percentUnvoiced 

-0.107 

f3meanfl 

0.091 

I 

-0.08 

nod 

0.069 

Prepositions 

0.078 

PercentBreaks 

-0.099 

f3STD 

-0.085 

shimmer 

-0.077 

PercentBreaks 

-0.067 

Positive emotion 

-0.077 

intensitySD 

0.092 

smile 

0.085 

fmeanS 

0.077 

maxDurPause 

-0.066 

f3meanfl 

0.077 

f2STDfl 

0.092 

Cognitive 

0.083 

percentUnvoiced 

-0.073 

flSTD 

-0.065 

Quantifiers 

0.075 

wc 

0.089 

upsec 

0.083 

PercentBreaks 

-0.071 

Prepositions 

0.063 

wc 

0.074 

Adverbs 

0.081 

percentUnvoiced 

-0.079 

max pitch 

0.071 

intensityMean 

0.061 

max pitch 

0.07 

f3meanfl 

0.081 

PercentBreaks 

-0.075 

avgBandl 

-0.07 

f2STDfl 

0.06 

uc 

0.07 

Cognitive 

0.08 

max pitch 

0.074 

nod 

0.07 

uc 

0.059 

Articles 

0.069 

f2meanfl 

0.078 

f2meanfl 

0.07 

Sadness 

0.069 

f2meanfl 

0.058 

maxDurPause 

-0.069 

avgBand2 

-0.074 

Adverbs 

0.069 

Cognitive 

0.064 
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